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ABSTRACT 

With some procedural differences r this study 
replicated an early study designed to develop eagorithms for 
converting scores on the Scholastic Aptitude Test (SAT) with those on 
the Prueba de Aptitud Academica (PAA) scaae and vice versa. The study 
involved selection of test items equally appropriate and useful for 
English- and SpcUiish-spealcing students for use as an anchor test and 
the equating analysis itself. Once the items were selected , they were 
administered as pretests ^ one for each language, to determine whether 
the two response functions for each item were sufficiently similau: 
for the items to be considered equiVcQent. On the basis of these 
analyses, 39 verbal and 25 mathematicaa items were selected for use 
as anchor items for equating. The anchor tests were administered at 
regularly scheduled administrations of the SAT and PAA. An item 
response theory model was used to equate the two tests. The equating 
itself showed curvilinear' relations in both verbal and mathematical 
tests, indicating that, in this instance, both sections of the PAA 
are easier than the corresponding SAT sections. Differences between 
these findings cmd those of the previous study by W. H. Angoff and C. 
C. Modu (1973) are assessed. Six graphs and six data tables are 
provided. (TJH) 
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ABSTRACT 

The, present study is a replication, in certain important 
respects, of ah earlier study conducted by Angoff and 
Modu (1973) to develop algorithms for converting 
scores expressed oh the College Board Scholastic Apti- 
tude Test (sat) scale to scores expressed on the Col- 
lege Board Prueba de Aptitud Academica (paa) scale, 
and vice versa. Because the purpose and the design of 
the studies, though not all of the psychoHictric proce- 
dures, were identical in the two studies, the language 
of this report often duplicates that of the earlier study. 
The differences in procedure, however, are worth not- 
ing, and it is hoped that this study will contribute in 
substance and method to the ".olution of this important 
problem. 

The study described in this report was undertaken 
in an effort to establish score equivalences between two 
College Board tests — the Scholastic Aptitude Test 
.(sAT.).andJtS-Spanish=langu3ge-equivalent-,-the-Prueba--- 
de Aptitud Academica (paa). The method involved two 
phases: (1) the selection of test items equally appropri- 
ate and useful for English- and Spanish-speaking stu- 
dents for use as an anchor test in equating the two tests; 
and (2) the equating analysis itself. The first phase 
called for choosing a set of items in each of the two 
languages, translating each item into the other lan- 
guage, "back-translating** independently into the origi- 
nal language, and comparing the twice-translated ver- 
sions with their originals. This process led.to the adjust- 
ment of the translations in several instances and, in 
other instances, to the elimination of some items consid- 
ered too difficult to be translated adequately. At this, 
point both sets of "equivalent*' items, each in its origi- 
nal language mode, were administered as pretests, 
chiefly to determine whether the two response func- 
tions for each item were sufficiently similar for the 
items to be considered equivalent. 

On the basis of these analyses two sets of items — 
one verbal and the other mathematical— were selected 
for use as anchor items for equating. These were admin- 
istered again (in the appropriate language) at regulariy 
scheduled administrations of the sat and the paa. An 
item response theory (irt) model was used to equate 
the PAA to the sat, with the anchor items serving as the 
linkin the equating process. 

Tlie equating itself showed definite curvilinear rela- 
tionships in both verbal and mathematical tests, indicat- 
ing in this instance that both sections of the paa are 
easier than the corresponding sat sections. The results 
also showed good agreement between the current con- 
versions and the 1973 Angoff-Modu conversions for the 
mathematical tests, but hot so close agreement for the 
verbal tests. The reasons for the difference are (specula- 
tively) attributed to improved methodology in the pres- 
ent study, especially for the more difficult verbal equat- 



ing, and to the possibility of scale drift in one or the 
other test (or both tests) over the intervening 12 to 15 
years since the last study. 

INTRODUCTION 

Although the study of cultural differences has been of 
central interest to educators and social psychologists for 
many years, attempts to develop a deeper understanding 
of such differences have been frustrated by the absence 
of a common metric by which many comparisons could 
be made. The reasons for this are clear. If two cultural 
groups differ from each other in certain ways that cast 
doubt on the validity of direct comparisons between 
them in other respects — if, for example, they differ in 
language, customs, and values — then those very differ- 
ences also defy the construction of an unbiased metric by 
which we could hope to make such comparisons. 

-We-findrhoweverrtha^there^arenimesrwhcnTcdW- 

parisons are nevertheless made, even though the basic 
differences in language, customs, and values, for exam- 
ple, which sometimes invalidate these comparisons, are 
knowh to exist. The present study has been designed in 
an attempt to develop a method to help make such 
comparisons in the face of these difficulties by provid- 
ing a common metric. Specifically, it purports to pro- 
vide a conversion of the verbal and mathematical scores 
on the College Board Spanish-language Prueba de 
Aptitud Academica (paa) to the verbal and mathemati- 
cal scores, respectively, on the College Board English- 
language Scholastic Aptitude Test (sat). Both tests, it 
is noted, are administered to secondary school students 
for admission to college. The paa is typically adminis- 
tered to Puerto Rican students who are planning to 
attend colleges and universities in Puerto Rico; the sat 
is typically administered to mainland students who are 
planning to attend colleges and universities in the 
United States. It was expected that if c:)nversion tables 
between the score scales for these two tests were made 
available, direct comparisons could be made between 
subgroups of the two language-cultures who had taken 
only that test appropriate for them. For the immediate 
purpose, however, it was expected that these conver- 
sion tables would help in the evaluation of the probable 
success of Puerto Rican students who were interested in 
eventually attending college on the mainland and were 
submitting paa scores for admission. As already indi- 
cated in the Abstract, the study was conducted in an 
effort to repeat the earlier study by Angoff and Modu, 
but with some modifications and improvements in 
method, and to confirm that the earlier results are still 
valid. 

Interest in developing conversions such as these 
has been expressed in various other contexts, usually in 
the assessment of the outcomes of education for differ- 
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ent cultural groups living in close proximity: for exam- 
ple, for English- and French-speaking students in Can- 
ada, for English- and Afrikaans-speaking students in 
South Africa, for speakers of one or another of the 
many languages in India or in Africa. No satisfactory 
methods to satisfy this interest have been available until 
recently, however^ and the problems attendant on mak- 
ing comparisons among culturally different groups are 
far more obvious and numerous than are the solutions. 
For example, to provide a measuring instrument to 
make these comparisons, it is clearly insufficient simply 
to translate the test constructed for one language group 
into the language of the other, even with adjustments in 
the items tp conform to the more obvious cultural re- 
quirements of the second group. It can hardly be ex- 
pected, without careful and detailed checks, that the 
translated itenis will have the same meaning and rela- 
tive difficulty for the second group as they had for the 
original group before translation. 

A method considerably superior to that of simple 
translation has been described by Boldt (1969). It re- 
quires the selection of a group of individuals judged to 
be equally bilingual and bicultural and the administra- 
tion of two tests to each individual, one test in each of 
the two languages. Scores on the two tests are then 
equated as though they were parallel forms of the same 
test, and a conversion table is developed relating scores 
on each test to scores on the other. 

One of the principal difficulties with the foregoing 
procedure, however, is that the judgment "equally bilin- 
gual and bicultural" is extremely difficult, perhaps even 
impossible, to make. More than likely, the individual 
members of the group, and even the group as a whole, 
will on average be more proficient in one of the two 
languages than in the other. This will be especially true, 
of course, if the group is small. 

This study represents an attempt to overcome such 
difficulties. In brief, it calls for administering the paa to 
Puerto Rican students and the sat to mainland United 
States students, using a set of "common," or anchor, 
items to calibrate and adjust for any differences be- 
tween the groups in the process of equating the two 
tests. It is noted that these items are common only in 
terms of the operations used to develop and select 
them. By the very nature of things they had to be admin- 
istered in Spanish to the Puerto Rican students and in 
English to the mainland students. Therefore, to the ex- 
tent that there is any validity in the notion that a set of 
test items can represent the same psychological task to 
individuals of two different languages and cultures, to 
the extent that the sense of the operations is acceptable, 
and to the extent that the operations themselves were 
adequate, the study will have achieved its purpose. 
There is also the concern that the Puerto Rican and the 
mainland groups appear to differ so greatly in average 
ability that with the limited equating techniques avail- 



able, it is not likely that any set of common items, 
however appropriate, can make adequate adjustments 
for the differences, even if the two tests were designed, 
for students of the same language and culture. 

There is, finally, the concern about the generaliz- 
ability of a conversion between tests that are appropriate 
for different cultural groups. In the usual equating prob- 
lem, a conversion function is sought' that will simply 
translate scores on one form of the test to the score scale 
of a parallel form of the test — an operation analogous to 
that of translating Fahrenheit units of temperature to 
Celsius units. When the two tests in question are measur- 
ing different types of abilities, however, or when one or 
both of the tests may be unequally appropriate for differ- 
ent subgioups of the population, the conversion cannot 
be unitary, as would be true of the temperature-scale 
conversion, but would be different for different sub- 
groups (Angoff 1966). In the present equating attempt, 
it is entirely possible that the use of different types of 
subgroups ior the equating experiment— Mexicans and 
Australians, for example, instead of Puerto Ricans and 
United States mainlanders — would yield conversion 
functions quite different from those developed in the 
present study. For this reason the conversions developed 
here should be considered to have limited applicability 
and should not be used without verification with groups 
of individuals different from those studied here. 



METHOD 

In broad outline the method followed in this study for 
deriving conversions of scores from the verbal and 
mathematical scales of the paa to the verbal and mathe- 
matical scales of the sat was the same as that followed 
in the Angoff-Modu (1973) study referred to above. As 
carried out previously, this study was conducted in two 
phases: The first phase entailed the selection of appro- 
priate anchor items for equating. This phase called for 
the preparation of sets of items both in Spanish and in 
English, the translation of each set into the other lan- 
guage by Puerto Rican educators proficient in both lan- 
guages, and the administration of both sets in the appro- 
priate language mode to Spanish- and English-speaking 
students. On the basis of an item analysis of the data 
resulting from this administration, groups of verbal and 
mathematical items were chosen to fulfill the principal 
requirement that they be equally appropriate, insofar 
as this could be determined, for both student groups. 
Beyond this requirement, the usual criteria for the 
choice of equating items as to difficulty, discrimination, 
and content coverage were adhered to, to the extent 
possible. In addition, care was taken, also where possir 
ble, to produce sets of anchor items reasonably bal- 
anced as to Spanish or English origin. Once the anchor 
items were chosen, the second phase of the study was 
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undertaken, which called for a second test administra- 
tion and an analysis for equating based on the data 
resulting from that administration. 

Phase 1 : Selection of Items for Equating 

In accordance with the foregoing pl^n, 105 Spanish ver- 
bal items, 110 English verbal items, 62 Spanish mathe- 
matical items, and 62 English mathematical items were 
drawn from the file and submitted to bilingual experts 
in Puerto Rico for translation. Two experts were as- 
signed to translate the Spanish verbal items into En- 
glish, and two other experts were assigned to translate 
the English verbal items into Sp^ nish. After this initial 
translation was completed, the two sets of experts inde- 
pendently back-translated each other's work into the 
original language. All translations were then sent to 
Educational Testing Service, where Spanish-language 
experts compared the back-translated items with tlie 
original items. Adjustments were made in the initial 
translations by the ets staff and by the staff of the Col- 
lege Board Puerto Rico Office when the comparisons 
revealed inadequacies. In some instances it was judged 
that revisions could not be made adequately, and as a 
result a number of items were dropped from further 
use. The same process was carried out for the mathe- 
matical items. Because of the smaller number of mathe- 
matical items, however, only two translators were 
used — one for translating items from Spanish to English 
and the other, items from English to Spanish. Eventu- 
ally two complete sets of items were compiled, 160 ver- 
bal atid 100 mathematical; each set appeared in both 
languages and, to the extent that this could be observed 
at an editorial level, wa:> equally meaningful in both 
languages. 

The 160 verbal items were of four types, parallel- 
ing the item types normally appearing in the opera- 
tional forms of the paa and the sat; antonyms, analo- 
gies, sentence completion, and reading comprehension. 
The 100 mathematical items fell into four content cate- 
gories: arithmetic, algebra, geometry, and miscella- 
neous. Detailed quantitative information on the pre- 
tested items is given later in this report. 

The, 160 verbal items and the 100 mathematical 
items were subdivided into four 40-item verbal sets and 
four 25-item mathematical sets and administered to spi- 
raled samples of regular College Board examinees. The 
test items in English were taken by candidates for the 
English-language sat at the January 1985 administra- 
tion; the same test items, in Spanish, were taken by 
candidates for the Spanish-language paa at the October 
1984 administration. All of the foregoing sets of items 
were administered in 30-minute periods. All sat and 
PAA examinee samples consisted of about 2,000 cases. 

Item response theory (irt) methods were used to 
compare performance on the verbal and/iHathematical 



items by the sat and paa groups. Items that functioned 
most similarly for the two groups were selected to con- 
stitute the 40-item verbal and 25-item mathematical 
equating tests. 

Of all the methods currently available for selecting 
items that function similarly for two groups of 
examinees, the three-parameter irt method (Lord 
1977; Petersen 1977; Shepard, Camilli, and Williams 
1984) used in this study is most preferable. This is so 
because it minimizes effects related to differences in 
group performance that seriously confound the results 
of simpler <procedures such as the delta-plot method 
(Angoff and Ford 1973) used in the previous study. 

Item response theory methods may be used to com- 
pare two groups of examinees with respect to their re- 
sponses to a particular item for the full ability (6) contin- 
uum. Item characteristic curves (iccs), such as those 
shown in Figure 1, describe the relationship between 
the probability of a correct response to an item and the 
degree of ability measured by the item. The curves in 
Figure 1 are described by the values of three item pa- 
rameters: a, b, and c. These parameters have specific 
interpretations: b is the point on the'^l metric at the 
inflection point of the icc (where the slope of the curve 
reaches its maximum and begins to decrease) and is 
taken as a measure of item difficulty; a is proportional 
to the slope of the icc at the point of inflection and 
represents the degree to which the item provides useful 
discriminations among individuals; c is the value of the 
lower asymptote of the icc (where the slope is essen- 
tially zero) and represents the probability that an 
examinee with very low ability will obtain a correct 
answer to the item. 

Studies of differential item difficulty were under- 
taken by estimating the iccs of the pretested items sepa- 
rately for the PAA and sat groups. Theoretically, if the 
item has the same meaning for the two groups, the 
probability oT a correct response should be the same for 
examinees of equal ability (i.e., for any value of 6 along 
the continuum). Panel A of Figure 1 contains a compari- 
son of item response functions obtained for a verbal 
item given to the paa and sat groups. It can be seen, 
from examination of the iccs in Panel A, that for all 
levels of ability (9) the paa group has a higher probabil- 
ity of obtaining a correct answer to the item; i.e., the 
item is seen to function in favor of the paa group. Panel 
B of Figure 1 contains a comparison of tecs obtained for 
a mathematical item given to the paa and sat groups. In 
contrast to the curves shown in Panel A, the iccs for the 
mathematical item given to the two groups of 
examinees are almost identical; i.e., individuals at all 
levels of ability in both groups have the same probabil- 
ity of obtaining a correct answer to the item. The item 
favors neither of the two groups. 

For this study, item parameters and examinee abili- 
ties were estimated by use of the computer program 
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Figure L Plots of item response functions for verbal (Panel A) and 
mathematical (Panel B) items given to PAA and SAT groups, illus- 
trating poor and good agreement between groups. 



LOGisT (Wingersky 1983; Wingersky, Barton, and Lord. 
1982). LOGIST produces estimates of a, b, and c for each 
item and 0 for each examinee. Inasmuch as item pa- 
rameter estimates for the sat and paa groups were ob- 
tained in separate calibrations, it was necessary to intro- 
duce an item parameter scaling step at this point. The 
item characteristic curve transformation method devel- 
oped by Stocking and Lord (19.83) was used for this 
purpose. 

The procedure to screen the pretested items for the 
analysis of differential item difficulty for the paa and 
SAT groups has been described by Lord (1980, chap. 
14). The method entails the following steps as they 
were carried out in this study: 



1 . Data for the combined paa and sat groups were 
used to obtain estimates of the c parameters for 
all the items. 

2. Holding c's fixed at these values, a and b item 
parameter estimates were obtained separately 
for the PAA and sat groups. 

3. Following the scaling of the parameter esti- 
mates, item characteristic curves for the two 
groups were compared, and those items that 
functioned differently in the two groups were 
identified. 

4. Items with significantly different iccs were re- 
moved from the pool of pretested items. 

5. Ability estimates were obtained for the com- 
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bined paa and sat groups, using the reduced set 
of items. 

6. Holding ability estimates fixed at values ob- 
tained in step 5, a and b parameter estimates 
were obtained for all pretested items (including 
those removed in step 4). 

7. iccs and estimates of item parameters were com- 
pared for the two groups, and the proposed set 
of 40 verbal and 25 mathematical equating 
items was chosen. 

Step 1, it is noted, calls for combining the data 
from the two groups in the calculation of c-parameter 
estimates and assuming that these estimates are the 
same in both groups. The reason for this practice is that 
c-parameter estimates are otherwise often poorly 
made, are sometimes even indeterminate, and cause 
difficulties in comparing parameter estimates across 
groups. The practice does not interfere with testing for 
significant differences among .the, a and .fe parameter 
estimates inasmuch as the null hypothesis of the irt x* 
test used here (Lord 1980, chap. 14) states that the 
values of a, by and c are the same for the two groups of 
interest. 

Steps 4 to 6 represent the irt analogue to criterion 
purification procedures used with conventional item* 
bias techniques. Lord (1980, chap. 14) has cautioned 
that the set of items of interest may not be measuring a 
unidimensional trait; thus, it is possible that ability esti- 
mates (0) as well as the iccs obtained for the paa group 
may not be strictly comparable to those obtained for 
the SAT group. One possible solution is to "purify" the 
test by removing the differentially functioning items 
and then to use the remaining set of unidimensional 
items to reestimate the 0's. Finally, the "purified" set of 
ability estimates is used to obtain the set of item parame- 
ter estimates and iccs (for the total pool of items) being 
compared. 

Many indices are available for quantifying the dif- 
ferences between item characteristic curves or item pa- 
rameter estimates for two groups of examinees. The 
two indices chosen for use in this study were the previ- 
ously mentioned irt y} (Lord 1980, chap. 14) and the 
mean of the absolute difference between the iccs. (See 
Cook, Eignor, and Petersen 1985 for a description of 
this statistic.) For each test, verbal and mathematical 
items were ranked according to their X" values. From 
the set of items with the smallest values, those with 
the smallest values of the mean absolute difference 
were chosen. The verbal and mathematical equating 
tests were constructed by use of this reduced pool of 
items. 

Summary statistics for all pretested items and for 
the items chosen to constitute the verbal and mathemati- 
cal equating tests are presented in Figures 2 and 3 and 



in Tables 1 to 3. Several points should be note<i. First, 2 
verbal items and 1 mathematical item were eliminated 
from scoring before the logist calibration. Second, it 
was not possible to obtain item paiameter estimates for 
13 verbal items and 6 mathematical items. Finally, 3 
verbal and 2 mathematical items were found to be so 
easy for both groups that stable r-biserial correlation 
coefficients could not be assured. These items were re- 
moved from the study. As a result, the pretested item 
pools were reduced to 142 verbal and 91 mathematical 
items. 

Figure 2 is a bivariate picture in which the 142 6's for 
the Spanish verbal items are plotted against the corre- 
sponding 6's for the same verbal items as they appeared 
in English. Figure 3 gives a similar plot for the math- 
ematical items. As may be seen from these figures, the 
verbal plot is much more dispersed than the mathemati- 
cal plot is, representing a much lower correlation for 
verbal items (r = .66) than for mathematical items (r = 
.90). In general, the correlation between the b'% may be 
regarded as a measure ofltem-byr^roupJntcraction— 
i.e., the degree to which the it;3ms represent, or fail to 
represent, the same rank order of difficulty in the two 
languages. In those instances where the two groups are 
drawn at random from the same general population, it 
is not unusual to see correlations between item diffi- 
culty indices in the neighborhood of .98 and even 
higher. That the correlation for the verbal items in this 
study is as low as it is indicates that the verbal items do 
not have quite the same psychological meaning for the 
members of the two language groups. Mathematics, on 
the other hand^ with its much higher correlation, ap- 
pears to be a more nearly universal language. In a 
sense, this is one of the more significant findings in this 
study because it reflects the very nature of the difficul- 
ties that are likely to be encountered in cross-cultural 
studies, especially when verbal tests are used. With re- 
spect to this study in particular, some doubt is cast on 
the quality of any equating that could be carried out 
with tests in these two languages and with groups as 
different as these. Since the equating items are used to 
calibrate for differences in the abilities of the paa and 
SAT groups, a basic requirement for equating is that the 
items have the same rank order of difficulty in the two 
groups; for the verbal items, it is clear that this require- 
ment is not met. Considerable improvement, in the 
sense of reducing the item-by-group interaction, was 
achieved in the verbal items (as will be shown below) ty 
discarding the most aberrant items among them and 
retaining those that showed the smallest differences be- 
tween the two language. groups in their item response 
curves. Nevertheless, with item-by-group interaction ef- 
fects even as large as those observed here for the items 
that were retained, the concern remains that the verbal 
equating might be much less trustworthy than would be 
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Figure 2. Plot of A's for pretested verbal items (number of items = 142). 



expected of an equating of two parallel tests intended 
for members of the same language-culture. 

It bears repetition, however, that these interac- 
tions were not entirely unexpected; the observation has 
often been made that verbal material, however well it 
may be translated into another language, loses many of 
its subtleties in the translation process. Even for mathe- 
matical items some shift m the order of item difficulty is 
to be expected, possibly because of differences between 
Puerto Rico and the United States with respect to the 
organization and emphasis of the mathematics curricu- 



lum in the early grades. But as has already been pointed 
out, the item-by-group interaction is much less for the 
mathematical items than for the verbal items. 

In Table 1 there is a summary of indices of diffi- 
culty (p-valucs) and discrimination (r-biserials) for the 
pretested items, as observed in the paa group and in the 
SAT group. They are presented separately for the verbal 
and mathematical items and, within those categories, 
separately by each item's language of origin. There is 
also a summary of those indices for the 39 verbajVind 25 
mathematical items planned for the equating. (It should 
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Figure 3. Plot ofb^s for pretested mathematical items (number of items = 91). 



be mentioned here that the original plan was to select 
40 verbal items and 25 mathematical items. After the 
selections were completed, however, it was felt neces- 
sary to revise one cf the verbaMtems substantially. In- 
stead it was dropped, reducing the number of verbal 
equating items from 40 to 39.) The tenth and eleventh 
columns in Table 1 give the means and standard devia- 
tions of . the index of discrepancy (the mean absolute 
difference) between the two item response curves, one 
of the indices used as the basis for selecting the items. 
Finally, Table 1 gives the correlations between the 6- 
parameter estimates for the two language groups, again 
by category of item. 

As can be seen in Table 1, the items are, on aver- 
age, considerably more difficult for the paa candidates 
than for the sat candidates. The difference between the 
mean p-values on the verbal items is more than one-half 
of a standard deviation; the difference on the mathe- 



matical items is considerably more than a full standard 
deviation. For both the paa and the sat candidate 
groups and for bc.h the verbal and the mathematical 
items, the items originating in Spanish appeared to be 
r^laiively easier than those originating in English. 

The second set ofTable 1 columns summarizes the 
item-test correlations (i>biserials) for the items in their 
Spanish and English forms. In general, both verbal and 
mathematical items appear to be less discriminating for 
the PAA candidates than for the *;at candidates, particu- 
larly so for the mathematical items. This difference in 
discririination is also present in the f.roup of selected 
items. It is observed that the mathematical items, at 
least for the sat group, have higher mean r-biserial 
correlations on average than do the verbal items, an 
observation that is frequently made in othc. reviews of 
these two item typ2s. 

As can be seen in the column summarizing the 
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Table 1. Summary Statistics for Pretested Items, by Language of Origin, before and after Selection of Equating Items 



DifficuUy Values (p) 



Item-Test Correlations (r^^^ 



Mean 



SD 



Mean 



SD 



Discrepancy 
Indices 



Correlations 



All Pretest Items 


Items* 


PAA 


SAT 


PAA 


SAT 


PAA 


SAT 


PAA 


SAT 






U 5 


Verbal 


























Oricinallv Enoli^h 


74 


.43 








n 
.jf 


.4/ 






.13 


1 1 
.11 


.61 


Originally Spanish 


68 


.50 


.63 


.21 


.21 


.41 


.42 


.13 


.14 


.13 


.11 


.66 


All verbal items 


142 


.46 


,58 


.21 


.23 


.39 


.45 


.14 


.12 


.13 


.11 


.66 


Mathematical 


























Originally English 


44 


.28 


.54 


.17 


.18 


.36 


.57 


44 


.09 


.04 


.03 


.96 


Originally Spanish 


47 


.33 


.62 


.20 


.20 


.43 


.60 


.16 


.11 


.05 


.04 


.89 


All mathematical items 


91 


.31 


.58 


.19 


.19 


.40 


.59 


.15 


.10 


.04 


.03 


.90 


Itc'ns selected 


























for equating 


























Verbal 


39 


.43 


.60 


.22 


.23 


.37 


.47 


.15 


.09 


.06 


.03 


.96 


Mathematical 


25 


.28 


.56 


.15 


.17 


.40 


.60 


.11 


.07 


.03 


.01 


.99 



*Three of the 145 verbal items and two of the 93 mathematical items were so easy for both groups that stable r biserial correlation eoefficients for these items 
eould not be assured. Consequently these indiees were not ealculated for the items in question. 



Table 2« Summary Statistics for Pretested Items, by Item Type 



Difficulty Values (p) 



Item-Test Correlations (r^^J 



Mean 



SD 



Mean 



SD 



Discrepancy 
Indices 



' inree oi me i4D veroai items and two or tne mathematical items were so easy tcr both gro 
could iiot be assured. Consequently these indices were not ealculated for the items in question. 



Correlations 



All Pretest Items 


Items* 


PAA 


SAT 


PAA 


SAT 


PAA 


SAT 


PAA 


SAT 


Mean 


SD 


b*s 


Verbal 


























Antonyms 


43 


.44 


.47 


.22 


.23 


.37 


.43 


.13 


.13 


.18 


.13 


.59 


Analogies 


34 


.41 


.59 


.19 


.24 


.42 


.45 


.15 


.12 


.13 


.10 


.62 


Sentence completion 


29 


.49 


.63 


.25 


.23 


.36 


.45 


.14 


.12 


.15 


.11 


.73 


Reading comprehension 


36 


.51 


.66 


.16 


.17 


.41 


.45 


.12 


.13 


.08 


.06 


.75 


Mathematical 


























Arithmetic 


21 


.29 


.58 


.16 


.20 


.42 


.60 


.19 


.10 


.05 


.03 


.93 


Algebra 


37 


.34 


.62 


.20 


.19 


.40 


.58 


.13 


.11 


.04 


.04 


.96 


Geometry 


26 


.28 


.54 


.19 


.19 


.38 


.60 


.14 


.09 


.05 


.03 


.92 


Miscellaneous 


7 


.29 


.54 


.16 


.23 


.35 


.53 


.18 


.07 


.03 


.01 


.99 



discrepancies betweeji the item response curves for the 
two groups, the discrepancies between the two curves 
for the verbal items are far greater than for the mathe- 
matical items. Also, it is observed that the items se- 
lected for equating show smaller mean discrepancies 
than is observed in the entire groups of pretested items. 
This is to be expected, of course, since the items were 
selected largely on the basis of the agreement between 
the two item response curves. 

The last column, giving the correlations between 
the fe's, expresses in correlational terms what has al- 
ready been observed in Figures 2 and 3 — namely, the 
item-by-group interaction with respect to item diffi- 
culty. Here we see again that the correlation bet>yeen 
the b parameters is much lower for the verbal items 
than for the mathematical items. And again, we see 
that the correlations between the fc-values for the se- 
lected items — especially the verbal items — are higher 
than for the unselected items. 

Table 2 is a summary of the same data as shown in 
Table I but classified by item type rather than by lan- 
guage of origin. The great difficulty of the items for the 
PA A group is readily observable in this table. It is also 
clear that the items in all four verbal and mathematical 
categories are more discriminating for the United 
States students than for the Puerto Rican students. 

It is interesting that the four verbal types arrange 
themselves into two distinct classes insofar as the corre- 
lations between their fc-values are concerned: higher 
correlations (smaller item-by-group interactions) are 
characteristic of the sentence completion and reading 
comprehension items, and lower correlations (larger 
item-by-group interactions) are characteristic of the an- 
tonyms and analogies. This result is intuitively reason- 
able since items with more context probably tend to 
retain their meaning, even in the face of translation into 
another language. 

Although the item groups are too small to permit 
easy generalization, it appears that there is consider- 
able and, very likely, significant variation from one 
verbal item type to another with respect to the similar- 
ity of the item response curves for the two candidate 
groups. (No such interaction is observed in the mathe- 
matical items.) The analog) items especially, and to 
some considerable extent the sentence completion and 
reading comprehension items, were more difficult rela- 
tive to antonyms for the Puerto Rican students than 
for the mainland United States students. This appears 
to be a subtle effect, very likely characteristic of the 
item type itself. It is certainly not a function of the 
origin of these items and their increased relative diffi- 
culty upon translation into the other language. As 
shown in Table 3, very nearly the same proportion of 
items for each of the categories was drawn from each 
language. 



Table 3. Distribution of Pretested Items, by Item Type 
and Language of Origin 





English 


Spanish 


Total 


Verbal 








Antonyms 


21 




43 


Analogies 


19 


15 


34 


Sentence 








completion 


16 


13 


29 


Reading 








comprehension 


18 


18 


36 


Total 


74 


68 


142 


Mathematical 








Arithmetic 


11 


10 


21 


Algebra 


15 


22 


37 


Geometry 


13 


13 


26 


Miscellaneous 


5 


2 


7 


Total 


44 


47 


91 



Phase 2: Equating 

Once the 39 verbal and 25 mathematical items that 
were to be used as *'common" — more properly, "quasi- 
common" — items were chosen, preparations were 
made to administer them to groups of candidates taking 
the PA A or the sat for admission to college. Accord- 
ingly, two samples of candidates were chosen from the 
December 1985 administration of the sat — one to take 
the verbal items in English (A^ = 6,017) in a 30-minute 
period, the other to take the mathematical items in 
English (A^ = 6,172), also in a 30-minute period, in 
addition to the regular operational form of the sat 
giveii^";* that time. Similarly, two samples of candidates 
were chosen from the October 1986 administration of 
the PAA — one to take the verbal items in Spanish (A^ = 
2,886) in a 30-minute period, the other to take the 
mathematical items in Spanish (A^ = 2,821), also in a 
30-minute period, in addition to the regular operational 
form of the paa given at that time. Both the sat and the 
PAA samples were drawn systematically from their par- 
ent candidate groups. The scaled score means for the 
SAT samples were 405 verbal and 455 mathematical, 
compared with their parent group means of 404 verbal 
and 453 mathematical. The scaled score means for the 
PAA samples were 466 verbal and 476 mathematical, 
compared with their parent group means of 465 verbal 
and 485 mathematical. In all instances the sample 
means approximated their parent means fairly closely. 

Before the paa verbal and mathematical scores 
were equated to the sat verbal and mathematical 
scores, care was taken to evaluate the common items to 
determine if they were functioning in the same manner 
for the PAA and sat samples. The evaluation was carried 
out by examining plots of item difficulty indices (delta 
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values*). Common items in this study were defined as 
"equally appropriate" to the Spanish- and English- 
speaking groups on the basis of the similarity of their 
rank-order position among other items for the two 
groups. Five verbal and two mathematical items that 
were considered "outliers" from this plot were deleted 
from the common-item sections. 

Tables 4 and 5 contain information that may be 
used to evaluate the extent to which the common 
items are, in fact, appropriate for both groups of 
examinees and the extent to which the operational 
tests are appropriate for their groups. Table 4 contains 
frequency distributions and summary statistics for the 
verbal operational and equating sections of the sat 
and the paa. It can be seen, from the verbaK equating 
data in Table 4, that the mainland sample is the higher 
scoring of the two groups by more than a full standard 
deviation. The difficulty of the 66-item paa appears to 
be just about right, on average, for the Puerto Rican 
sample; the average percentage-pass on that test (cor- 
rected for guessing) was .49. The 85-ite.a sat is clearly 
difficult for the mainland sample; the average per- 
centage-pass on that test (also corrected for guessing) 
was .40. 

The patterns of standard deviations and correla- 
tions observed in Table 4 between the equating test in 
English and the sat and between the equating test in 
Spanish and the paa suggest that each of these verbal 
equating tests is reasonably parallel in function to the 
operational test with which it is paired. 

The data presented in Table 5 describe frequency 
distributions and summary statistics for the mathemati- 
cal operational and equating sections of the SAT:and the 
paa. The mathematical equating data in Table 5 reveal 
even more sharply than do the verbal equating data in 
Table 4 that the mainland sample is the higher-scoring 
of the two. The mean difference in the mathematical 
common items is about 1.4 standard deviations. Also, 
rote that contrary to the verbal test, the operational 
PAA-mathematical test was as difficult for.the paa sam- 
ple (percentage-pass, corrected for guessing, was .39) 
as was the SAT-mathematical test for the sat sample 
(percentage-pass, corrected for guessing, was .40). 

Unlike the results shown for the verbal tests in 
Table 4, the patterns of standard deviations and correla- 
tions in Table 5 between the sat and the equating test in 
English. and between the paa and the equating test in 
Spanish suggest that the equating test may be consid- 
ered parallel in function to the sat but not quite so 
parallel to the paa. 



1. The delta index is a transformation of the proportion of the group 
who answer an item correctly (/?+) to a normal deviate and from 
z to a scale with a mean of 13 and a standard deviation of 4 by means 
of the equation A « -4z + 13. 



Two kinds of equating were undertaken, linear and 
curvilinear. Of the several linear methods, two were 
chosen for use; one is attributed to Tucker (in Angoff 
1984, p. 110) and the other to Levine (1955). Two curvi- 
linear methods were used: equipercentile equating (see 
Angoff 1984, p. 97) and item response theory equating 
(Lord 1980, chap. 13). Although the results of all the 
methods were evaluated, only the item response theory 
results were used and consequently are the only method 
and results described in this report. 

Item response theory (irt) assumes the re is a mathe- 
matical function that relates the probability of a correct 
response on an item to an examinee's ability. (See Lord 
1980 for a detailed discussion.) Many different mathe- 
matical models of this functional relationship are possi- 
ble. As mentioned in the preceding section, the model 
chosen for this study was the three-parameter logistic 
model. In this model the probability (P) of a correct 
response to item / for an individual with ability 0, P,(6), is 

P'(^) = '^'-^JT7^^i^' (1) 

where a^, 6,-, and q are three parameters describing the 
■item, and 0 represents an examinee's ability. 

The iRT equating. method used in this study is re- 
ferred to as iRT concurrent equating. (See Petersen, 
Cook, and Stocking 1983; also Cook and Eignor 1983 
for a discussion of several irt equating methods.) For 
IRT concurrent equating, all item parameter estimates 
for old and new test editions are calibrated in a single 
LOGiST run. This process results in item parameters ex- 
pressed on a common scale and allows direct equating 
of the new and the old test editions. 

Once item parameter estimates on a common scale 
have been obtained, there are a number of irt equating 
procedures that may be used. This study, however, was 
concerned only with true formula score equating (Lo*d 
1980). The expected value of an examinee's observed 
formula score is defined as his or her true formula 
score. For the true formula score, 5, we have 

^=l^P.(^)-f (2) 

where n is the number of items in the test and (k^ + 1) is 
the number of choices for item /. If we have two tests 
measuring the same ability 6, then true formula scores ^ 
and t] from the two tests are related by Equation (2), 
given above, and Equation (3); 

where Equation (3) parallels Equation (2), but for 
items ; (= 1 - n). Cleariy, for a particular 6, corre- 
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Table 4. Frequency Distributions and Summary Statistics for Verbal Operational and Equating 
Sections of the SAT and PAA 



Mahiiand Sample 



Puerto Rkiin Sample 



Raw 
(Formula Score) 


Operational 
SAT 


Equating 
Section 


Operational 

p'aa 


Eipuiting 
Section 


81-83 


3 








78-80 


16 








75-77 


23 








72-74 


28 








69-71 


39 








66-68 


67 








63-65 


87 








60-62 


122 




15 




57-59 


137 




33 




54-56 


185 




90 




51-53 


217 




83 




48-50 


277 




141 




45-47 


333 




193 




42-44 


333 




201 




39-41 


363 




241 




36-38 


416 




209 




33-35 


459 


44 


239 




30-32 


441 


216 


260 


5 


27-29 


405 


495 


219 


31 


24-26 


404 


625 


233 


47 


21-23 


362 


804 


164 


126 


18-20 


327 


939 


152 


244 


15-17 


263 


874 


166 


337 


12-14 


245 


863 


94 


502 


9-11 


190 


500 


80 


424 


6-8 


128 


345 


45 


459 


3-5 


78 


212 


20 


400 


0-2 


29 


71 


7 


209 


-3-1 


27 


26 


1 


loa 


-6-4 


9 


2 




— 


-9-7 


4 








Number of cases 


6,017 


6,017 


2,886 


2,886 


Mean 


33.80 


17.72 


32.41 


10.58 


SD 


15.91 


7.17 


12.64 


6.56 


Correlation: 










Operational vs. Equating 


.841 






.806 


Number of items 


85 


34 


66 


34 













spending true scores^ and have identical meaning. 
They are thus said to be equated. 

Because true formula scores below the chance- 
score level are undefined for the three-parameter logis- 
tic model, some method must be established to obtain a 
relationship between scores below the chance level on 
the two test forms to be equated. The approach used 
for this study was to estimate the mean and the stan- 
dard deviation of below-chance-Ievel scores on the two 
tests to be equated (see Lord 1980). Then these esti- 
mates were used to do a simple linear equating between 
the two sets of below-chance-Ievel scores. 



In practice, true score equating is earned out by 
substituting estimated parameters into Equations (2) 
and (3). Paired values of ^ and -q are then computed for 
a series of arbitrary values of 0. Since we cannot know 
an examinee's true formula score, we proceed as if rela- 
tionships (2) and (3) apply to the examinee's observed 
formula score. 

The final outcome of the irt equating of the paa 
verbal and mathematical tests to the sat verbal and 
mathematical tests was two conversion tables; one table 
relates raw scores on the PAA-verbal to raw scores on 
the SAT-verbal, and the second table relates raw scores 
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Table 5. Frequency Distributions and Summary Statistics for Mathematical Operational and 
Equating Sections of the SAT and PAA 



Mainland Sample 



Puerto Rican Sample 



Raw 
(Formula Score} 


Operational 
SAT 


Equating 

Section 


Operational 
PAA 


Equating 
Section 


DV-OU 


19 








C>/— Do 


14 










29 










'i'i 

33 








DI— DZ 


56 










74 




5 




Al AH 
*4/— *4o 


79 




9 






106 




24 






101 




28 




41— 4Z 


16U 




4[- 




jy— 4U 


160 




44 




.?/— .?o 


Zl 1 




51 




.>0 


248 




92 






ZDo 




80 










116 










92 




te/— 


JOO 




153 










121 






f l£ 
410 


206 


159 


2 






4oo 


190 


10 




j4I 


320 


183 


10 


I lO 


J/U 


DOD 


201 


25 


1 ^— 1 A 


J21 


439 


178 


27 


1 "^—IJ 

1 J~I*T 


JZO 


Ct 1 

Dll 


229 


• 40 


11-12 


248 


613 


205 


75 


9-10 


257 


556 


183 


119 


7-8 


200 


677 


165 


172 


5-6 


187 


554 


90 


295 


3-4 


136 


571 


90 


444 


1-2 


64 


460 


47 


733 


-1-0 


47 


188 


29 


550 


-3-2 


15 


82 


4 


272 


— ^ A 

— D— 4 


7 


4 


2 


45 


-1-6 


1 




2 


2 


Number of cases 


6,172 


6,172 


2,821 


2,821 


Mean 


24.17 


10.82 


19.56 


2.92 


SD 


12.48 


6.72 


10.71 


4.43 


Correlation: 










Operational vs. Equating. 


.879 






.740 


Number of items 


60 


23 


50 


23 



on the PAA-mathematical to raw scores on the sat- 
mathematical. Conversion tables showing the relation- 
ship between scaled scores on the respective paa and 
SAT tests were derived irom the raw-to-raw conversion 
tables. Scaled score conversions for the verbal and 
mathematical tests are presented in Table 6. It is clear 
from the Table 6 list of verbal equivalences that the 
difference bet>yeen the two scales is as much as 180 to 
185 points at a paa score of 500. The differences are 
smaller at the extremes of the score scale. 



The equivalences for the mathematical tests also 
show striking differences between the paa and the sat 
scales. In the vicinity of a PAA-mathematical score of 
500 there is also a difference of 180 to 185 points. As is 
the case for the verbal equivalences, the differences are 
smaller at the extremes of the score scale. 

Graphs of the verbal and mathematical irt equat- 
ing results appear in Figures 4 and 5. It is evident, even 
hom a cursory glance at these figures that they suggest 
markedly curvilinear conversions between the paa and 
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SAT, typical of the results of equating two tests that 
differ pronouncedly in difficulty. Such conversions, 
which are likely to be very nearly the same irrespective 
of the particular method of equating employed in pro- 
ducing them, are seen to be concave upward when the 
easier test is plotted on the horizontal axii> and the more 
difficult test on the vertical axis. In this instance the paa 
is clearly the easier test, and — inasmuch as the concav- 
ity is deeper for the mathematical test — it appears also 
that the mathematical tests are more different in diffi- 
culty than the verbal tests. 

Some attention should be given to the meaning of 
the differences in the paa and sat scales. That a 500 
score on the paa corresponds to a lower-than-500 score 
on the SAT simply says that if one can assume that the 
SAT and PAA values have been maintained precisely 
since the time of their inception, it can be concluded 
that the original scaling group for the sat was generally 
more able in the abilities measured by these aptitude 
tests than was the original scaling group for the paa. It 
does not by itself imply that the sat candidate group 
today is necessarily more able than the paa group, al- 
though this appears, in fact, to be the case. Nor does it 
necessarily suggest any generalization regarding the 
large populations from which these two examinee 
groups were self-selected — e.g., that the twelfth-grade 
students on the mainland score higher than do the 
twelfth graders in Puerto Rico. We know that the sat 
examinee group is about one-third the size of the 



twelfth-grade population on the mainland and is there- 
fore a more selective group than is its paa counterpart, 
which represents a substantial proportion (over 95 per- 
cent), of the twelfth-grade population in Puerto Rico. 
On the other hand, this is not to say that-differences 
between the two twelfth-grade populations do not also 
exist. There is some evidence, however crude, that 
marked differences do exist. But this evidence is out- 
side the scope of this study. 

In view of these and other possible misinterpreta- 
tions of the data of this report, it will be us^'ful to 
restate the limited purpose for which the present investi- 
gation was undertaken: to derive a set of conversions 
between two similar-appearing scales of measurement — 
one for tests of one language and culture, the other for 
tests of a different language and culture. Clearly, the 
accuracy of these conversions is limited by the appropri- 
ateness of the method used to derive them and the data 
assembled during the course of the study. It is Moped 
that these conversions will be useful in a variety of 
contexts, but (as suggested by the examples cited here) 
to be useful, they will need in each instance to be sup- 
ported by additional data peculiar to the context. 

A natural question that would arise at this point is. 
How well do the equivalences developed in this study 
compare with those developed in the 1973 Angoff- 
Modu study? In the eariier study, it is recalled, two 
linear methods were used in addition to a curvilinear 
method. The final conversions reported there were 
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Table 6. Final Conversions between PAA and SAT 



Verbal* Mathematical* 



PAA 


Equivalent 


PAA 


Equivalent 


Scaled Scores 


SAT Scaled Scores 


Scaled Scores 


SAT Scaled Scores 


800 


785 


800 


785 


787 


757 


790 


743 


774 


709 


779 


676 


761 


660 


768 


629 


749 


617 


758 


593 


736 


584 


747 


564 


723 


557 


736 


539 


710 


535 


726 


518 


697 


516 


715 


499 


684 


500 


704 


482 


672 


485 


694 


467 


659 


472 


683 


453 


646 


460 


672 


440 


633 


449 


662 


429 


625 


438 


651 


418 


617 


428 


640 


408 


609 


419 


630 


399 


601 


410 


619 


390 


592 


401 


608 


382 


584 


393 


598 


374 


576 


384 


587 


366 


568 


376 


576 


359 


560 


369 


566 


353 


552 


361 


555 


346 


544 


354 


544 


340 


535 


347 


534 


334 


527 


340 


523 


329 


519 


333 


512 


323 


511 


326 


502 


318 


503 


319 


491 


313 



*Scalcd scores arc not normally reported higher than 800 or lower than 200 on either 
the PAA or the SAT. Some scores below 200 are reported here to show the nature of 
the conversions near the ends of the scale. 



Note: Care should be exercised in the proper use and interpretation of 
Jable 6. See the text of this report, beginning with the second paragraph 
on page 16 and continuing through page 17, for a discussion of the 
limitations of Table 6 and for cautions regarding its possible misuses. 



taken to be an average of the three, essentially weight- 
ing the curvilinear results equally with the average of 
the two linear results. In the present study, with the 
benefit of improved (item response theory) techniques 
for equating and with somewhat greater understanding 
of equating theory, it was decided to base the entire 
operation on the curvilinear equating as determined by 
the IRT procedure. The results of this study yielded sub- 
stantially lower conversions to the SAT-verbal scale than 
was the case in the earlier study, especially in the large 
middle range between about 450 and about 750. The 
conversions to the sAT-mathematical scale showed bet- 
ter agreement with the earlier results. One can only 
speculate regarding the reasons for the agreement in 



the mathematical and the disagreement in the verbal. 
Part of it is undoubtedly attributable to a change in 
equating method. Another reason is the possibility of 
drift in the equating of either the PAA-verbal scale or the 
SAT- verbal scale, or both, over the intervening 12 to 15 
years, causing a difterence between the present results 
and those found in the earlier study. Yet another rea- 
son, as has been discussed in other places in this report, 
is that verbal equating across two languages and cul- 
tures is so much more problematic than is true of mathe- 
matical equating. In any case, we suggest that for rea- 
sons of improved methodology, the present results are 
probably more trustworthy than those given in the ear- 
lier, Angoff-Modu report. 
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Table 6. Continued 



Verbal* Mathonatical* 



PAA 
Scaled Scores 


Equivalent 
SAT Scaled Scores 


PAA 
Scaled Scores 


Equivalent 
SA T Scaled Scores 


495 


313 


480 


308 


487 


307 


470 


303 


478 


301 


459 


299 


470 


295 


448 


295 


462 


289 


438 


290 


454 


283 


427 


286 


446 


278 


416 


282 


438 


272 


406 


278 


430 


267 


395 


274 


421 


262 


384 


269 


413 


257 


374 


265 


405 


252 


363 


261 


397 


248 


352 


257 


389 


243 


342 


253 


381 


238 


331 


249 


373 


234 


320 


245 


364 


229 


310 


241 


356 


225 


299 


237 


348 


221 


288 


232 


340 


216 


278 


228 


332 


212 


267 


224 


324 


208 


256 


220 


316 


204 


246 


216 


307 


200 


235 


212 


299 


196 


224 


209 


291 


192 


214 


205 


283 


188 


203 


197 


275 


184 


200 


188 


267 


180 






259 


176 






250 


172 






242 


168 






234 


163 






226 


158 






218 


155 






210 


152 






202 


150 






200 


148 







Note: Care should be exercised in the proper use and interpretation of 
Table 6. See the text of this report, beginning with the second paragraph 
on page 16 and continuing through page 17, for a discussion of the 
limitations of Table 6 and for cautions regarding its possible misuses. 



SUMMARY AND DISCUSSION 

The purpose of this study was to establish score 
equivalences between the College Board Scholastic Apti- 
tude Test (sat) and its Spanish-language equivalent, the 
College Board Prueba de Aptitud Acaddmica (paa). 
The method of the study involved two phases: (1) the 
selection of test items equally appropriate and useful for 
Spanish- and English-speaking students for use in equat- 



ing the two tests and (2) the equating analysis itself. The 
method of the first phase was to choose two sets of items, 
one originally appearing in Spanish and the other origi- 
nally appearing in English; to translate each set into the 
other language; to "back-translate" each set, indepen- 
dently of the first translation, into the original language; 
and to compare the original version of each item with its 
twice-translated version and make adjustments in the 
translation where necessary and possible. Finally, after 
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the items were thought to be satisfactorily translated 
(some items that appeared to defy adequate translation 
were dropped), both sets of "equivalent" items — one in 
English and the other in Spanish — were administered, 
each in its appropriate language mode, for pretest pur- 
poses. These administrations were conducted in Octo- 
ber 1984 for the paa group and in January 1985 for the 
SAT group; samples of candidates took the paa or the sat 
at regularly scheduled administrations. They provided 
conventional psychometric indices of the difficulty and 
discriminating power of each item for each group. In 
addition, they provided two item response functions for 
each item, one as it appeared in Spanish and was adminis- 
tered to the Spanish-speaking candidates and one as it 
appeared in English and was administered to the 
English-speaking candidates. Both functions were ar- 
ranged to appear on the same scale so that discrepancies 
between them could easily be observed. Finally, indices 
of agreement between the functions and measures of 
goodness-of-fit of the data to the item response function 
were also made available. 

On the basis of the analyses of these data, two sets 
of items — one verbal* and the other mathematical — 
were chosen and assembled as '^common" items to be 
used for equating. In the second, or equating, phase of 
the study these common items, appearing both in Span- 
ish and in English, were administered in the appropri- 
ate language along with the operational form of the sat 
in December 1985 and with the operational form of the 
PAA in October 1986. The data resulting from the admin- 
istrations of these common items were used to calibrate 
for differences in the abilities of the two candidate 
groups and permitted equating the two tests by means 
of the item response theory method. The final conver- 
sion tables relating ihe PAA-verbal scores to the sat- 
verbal scores and the PAA-mathematical scores to the 
SAT-mathematical scores are given in Table 6. Because 
of the scarcity of data at the upper end of the distribu- 
tion of PAA scores, score equivalences in that region are 
not considered highly reliable. 

The general approach followed in conducting this 
study requires special discussion, perhaps all the more 
so because the method is simple, at least in its concep- 
tion. On the other hand, from a psychological view- 
point the task of making cross-cultural comparisons of 
the kind made here is highly complex. In the extreme 
the task is inescapably impossible, and although the 
present study may represent a reasonably successful at- 
tempt, it should be remembered that the cultural differ- 
ences confronted by the study were minimal and rela- 
tively easily bridged. If, for example, the two cultures 
under consideration were very different, then there 
would be little or no common basis for comparison. 

Given, then, that the cultures out of which the tests 
in this study were developed are to some extent similar. 



and that there is indeed a basis for comparison, the ap- 
proach and method offered do appear to have some like- 
lihood of success. Indeed, the method itself is useful in 
providing a type of metric for utilizing the common basis 
for comparison. For example, it allows a comparison of 
the two cultures only on a common ground, which is to 
say only on those items whose item response functions 
were relatively similar, items that apparently had the 
same "meaning'* in both languages and cultures. This 
being the case, those characteristics of the two cultures 
that make them uniquely different are in essence re- 
moved from consideration in making the comparisons. 
Thus, while we are afforded an opportunity to compare 
the two cultures on a common basis — i.e., on the ite: :s 
that are "equally appropriate" — at the same time we are 
also afforded an opportunity to examine the differences 
in the two cultures in the terms provided by the divergent, 
or "unequally appropriate," items. It is noteworthy that 
what emerges from this study is that the method de- 
scribed here also yields a general measure of cultural 
similarity, expressed in the index of discrepancy between 
the two item response functions. The index — rather, the 
reciprocal of the index — summarizes the degree to which 
members of the two cultures perceive the item stimulus 
similarly. Additional studies of the similarity of any two 
cultures would have to be based on other stimuli exam- 
ined in a wide variety of different social contexts. 

It should also be made clear that the method has its 
limitations, as do the results of this study, which has fol- 
lowed the method. For example, the present study has 
relied on the usefulness of translations from each of the 
two langua ges to the other, and the assumption has been- 
made that biases in translation, if they exist, tend to bal- 
ance out. This assumption may not be tenable, however. 
Quite possibly translation may be easier and freer of bias 
when it is from Language A to Language B than in the 
reverse direction, and if items do become somewhat 
more difficult in an absolute sense as a result of transla- 
tion, this effect would be more keenly felt by speakers of 
Language A than by speakers of Language B. Also, im- 
plicit in the method of this study is the assumption that 
language mirrors all the significant cultural effects. This 
may not be so, and it is possible that the translatability of 
words and concepts across two languages does not accu- 
rately reflect the degree of similarity in the cultures repre- 
sented by those two languages. If, for example, there are 
greater differences in the languages than in the cultures, 
then again the method is subject to some bias. 

Aside from matters of methodology and possible 
sources of bias, a point that has been made earlier in this 
report deserves repeating: In this study the comparison 
was made between Puerto Rican and mainland United 
States students; the resulting conversions between the 
PAA and the sat apply only between these two groups of 
students. Whether the same conversions would also 
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Have been found Had the study been conducted between 
the PAA and the sat as taken by other Spanish speakers 
and other English speakers is an open question. Indeed, 
it is.an open question whether the conveioon obtained 
here also applies to variously defined subgroups of the 
Puerto Rican and mainland populations—liberal arts 
women, engineering men, urban blacks, e?*:. 

It is also to be hoped that the conversions between^ 
the two types of tests will not be used with^Mt a clear 
recognition of the realities: A Puerto Rican student 
with a PAA-verbal score of 503 has been found here to 
have an SAT-verbal score "equivalent" of 319. This is 
not to say that an sAT-verbal score of 319 would actually 
be earned were the student to take the SAT. The stu- 
dent might do better or might do worse, depending, 
obviously, on the student's facility in English. The con- 
versions do offer a way, however, of evaluating a gen- 
eral aptitude for verbal and mathematical materials in 
terms familiar to users of sat scores; depending on how 
well the student can be expected to learn the English 
language, the likelihood of success in competition with 
native English speakers in the continental United States 
can be estimated. Continuing study of the comparative 
validity of the paa and the sat for predicting the perfor- 
mance of Puerto Rican students in mainland colleges is 
indispensable to the judicious use of these conversions. 

It will be useful, finally, to describe the ways in 
which the conversions may and may not be used appro- 
priately. A glaring misuse has already been alluded to 
above: It would be entirely inappropriate to conclude 
without further consideration that a student who has 
earned. a PAA-verbal score-of 503^would therefore earn 
an SAT-verbal score of 319, were he or she to take it, 
simply because the table reads that these two scores are 
listed as "equivalent." As already indicated above, the 
student might score lower than 319, depending on his or 
her facility in English. Thus, intelligent use of the table 
requires the additional knowledge of the student's facil- 
ity in English. For this purpose scores on a test like the 
Test of English as a Foreign Language (toefl), measur- 
ing the student's understanding of written and spoken 
English, would be useful. (Another possibility is a test, 
if one exists, that can accurately predict how rapidly a 
student is likely to learn a foreign language.) If the 
Spanish-speaking student's toefl scores are high, indi- 
cating a level of facility in English equivalent to that of 
a native speaker of English, these conversions may be 
taken at face value with appropriate cautions for their 
generalizability, as described earlier. If, however, the 
student's English-language ability is not high, the con- 
versions given here will be inapplicable to the degree 
that English is an unfamiliar language to that student. 
Further, it would be expected that inasmuch as the sat- 
verbal test depends more heavily on English language 
ability- than does the SAT-mathematical test, the verbal 



conversion for the paa to the sat will therefore be more 
sensitive to the inadequacies of the student's knowledge 
of English than wilLbc true of the mathematical conver- 
sion. But these guidelines are at best based only on 
educated intuition. As already indicated above, the con- 
tinuing conduct of validity studies will yield the best 
guidance for the proper use of these scores. 
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